Pesquisa | Portal Regional da BVS

BioC interoperability track overview.

Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie; Dogan, Rezarta Islamaj; Yepes, Antonio Jimeno; Khare, Ritu; Lu, Zhiyong; Marques, Hernani; Mattingly, Carolyn J; Neves, Mariana; Peng, Yifan; Rak, Rafal; Rinaldi, Fabio; Tsai, Richard Tzong-Han; Verspoor, Karin; Wiegers, Thomas C; Wu, Cathy H; Wilbur, W John.

Database (Oxford) ; 20142014.

Artigo em Inglês | MEDLINE | ID: mdl-24980129

RESUMO

BioC is a new simple XML format for sharing biomedical text and annotations and libraries to read and write that format. This promotes the development of interoperable tools for natural language processing (NLP) of biomedical text. The interoperability track at the BioCreative IV workshop featured contributions using or highlighting the BioC format. These contributions included additional implementations of BioC, many new corpora in the format, biomedical NLP tools consuming and producing the format and online services using the format. The ease of use, broad support and rapidly growing number of tools demonstrate the need for and value of the BioC format. Database URL: http://bioc.sourceforge.net/.

Assuntos

Biologia Computacional , Mineração de Dados , Processamento de Linguagem Natural , Software , Pesquisa Biomédica , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Internet

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Dogan, Rezarta Islamaj; Leaman, Robert; Lu, Zhiyong.

J Biomed Inform ; 47: 1-10, 2014 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-24393765

RESUMO

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.

Assuntos

Doenças Genéticas Inatas/genética , Bases de Conhecimento , PubMed , Inteligência Artificial , Biologia Computacional , Mineração de Dados , Humanos , Armazenamento e Recuperação da Informação , Medical Subject Headings , National Institutes of Health (U.S.) , Processamento de Linguagem Natural , Semântica , Terminologia como Assunto , Estados Unidos , Vocabulário Controlado

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text.

Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-Aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio; Leaman, Robert; Gonzalez, Graciela; Matos, Sergio; Kim, Sun; Wilbur, W John; Rocha, Luis; Shatkay, Hagit; Tendulkar, Ashish V; Agarwal, Shashank; Liu, Feifan; Wang, Xinglong; Rak, Rafal; Noto, Keith; Elkan, Charles; Lu, Zhiyong; Dogan, Rezarta Islamaj; Fontaine, Jean-Fred; Andrade-Navarro, Miguel A; Valencia, Alfonso.

BMC Bioinformatics ; 12 Suppl 8: S3, 2011 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-22151929

RESUMO

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them. RESULTS: A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89% and the best AUC iP/R was 68%. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53%, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35%) the macro-averaged precision ranged between 50% and 80%, with a maximum F-Score of 55%. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows.

Assuntos

Algoritmos , Mineração de Dados , Proteínas/metabolismo , Animais , Bases de Dados de Proteínas , Humanos , Publicações Periódicas como Assunto , PubMed

A textual representation scheme for identifying clinical relationships in patient records.

Dogan, Rezarta Islamaj; Névéol, Aurélie; Lu, Zhiyong.

Proc Int Conf Mach Learn Appl ; 2010: 995-998, 2011 Feb 04.

Artigo em Inglês | MEDLINE | ID: mdl-21552455

RESUMO

The identification of relationships between clinical concepts in patient records is a preliminary step for many important applications in medical informatics, ranging from quality of care to hypothesis generation. In this work we describe an approach that facilitates the automatic recognition of relationships defined between two different concepts in text. Unlike the traditional bag-of-words representation, in this work, a relationship is represented with a scheme of five distinct context-blocks based on the position of concepts in the text. This scheme was applied to eight different relationships, between medical problems, treatments and tests, on a set of 349 patient records from the 4th i2b2 challenge. Results show that the context-block representation was very successful (F-Measure = 0.775) compared to the bag-of-words model (F-Measure = 0.402). The advantage of this representation scheme was the correct management of word position information, which may be critical in identifying certain relationships.

Extracting Rx information from clinical narrative.

Mork, James G; Bodenreider, Olivier; Demner-Fushman, Dina; Dogan, Rezarta Islamaj; Lang, François-Michel; Lu, Zhiyong; Névéol, Aurélie; Peters, Lee; Shooshan, Sonya E; Aronson, Alan R.

J Am Med Inform Assoc ; 17(5): 536-9, 2010.

Artigo em Inglês | MEDLINE | ID: mdl-20819859

RESUMO

OBJECTIVE: The authors used the i2b2 Medication Extraction Challenge to evaluate their entity extraction methods, contribute to the generation of a publicly available collection of annotated clinical notes, and start developing methods for ontology-based reasoning using structured information generated from the unstructured clinical narrative. DESIGN: Extraction of salient features of medication orders from the text of de-identified hospital discharge summaries was addressed with a knowledge-based approach using simple rules and lookup lists. The entity recognition tool, MetaMap, was combined with dose, frequency, and duration modules specifically developed for the Challenge as well as a prototype module for reason identification. MEASUREMENTS: Evaluation metrics and corresponding results were provided by the Challenge organizers. RESULTS: The results indicate that robust rule-based tools achieve satisfactory results in extraction of simple elements of medication orders, but more sophisticated methods are needed for identification of reasons for the orders and durations. LIMITATIONS: Owing to the time constraints and nature of the Challenge, some obvious follow-on analysis has not been completed yet. CONCLUSIONS: The authors plan to integrate the new modules with MetaMap to enhance its accuracy. This integration effort will provide guidance in retargeting existing tools for better processing of clinical text.

Assuntos

Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Preparações Farmacêuticas , Humanos , Alta do Paciente , Design de Software

Author keywords in biomedical journal articles.

Névéol, Aurélie; Dogan, Rezarta Islamaj; Lu, Zhiyong.

AMIA Annu Symp Proc ; 2010: 537-41, 2010 Nov 13.

Artigo em Inglês | MEDLINE | ID: mdl-21347036

RESUMO

As an information retrieval system, PubMed(®) aims at providing efficient access to documents cited in MEDLINE(®). For this purpose, it relies on matching representations of documents, as provided by authors and indexers to user queries. In this paper, we describe the growth of author keywords in biomedical journal articles and present a comparative study of author keywords and MeSH(®) indexing terms assigned by MEDLINE indexers to PubMed Central Open Access articles. A similarity metric is used to assess automatically the relatedness between pairs of author keywords and indexing terms. A set of 300 pairs is manually reviewed to evaluate the metric and characterize the relationships between author keywords and indexing terms. Results show that author keywords are increasingly available in biomedical articles and that over 60% of author keywords can be linked to a closely related indexing term. Finally, we discuss the potential impact of this work on indexing and terminology development.

Assuntos

MEDLINE , Medical Subject Headings , Humanos , PubMed

Features generated for computational splice-site prediction correspond to functional elements.

Dogan, Rezarta Islamaj; Getoor, Lise; Wilbur, W John; Mount, Stephen M.

BMC Bioinformatics ; 8: 410, 2007 Oct 24.

Artigo em Inglês | MEDLINE | ID: mdl-17958908

RESUMO

BACKGROUND: Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. RESULTS: We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. CONCLUSION: Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.

Assuntos

Biologia Computacional/métodos , Sítios de Splice de RNA/fisiologia , Biologia Computacional/tendências , Humanos , Valor Preditivo dos Testes , RNA Mensageiro/fisiologia

Structural footprinting in protein structure comparison: the impact of structural fragments.

Zotenko, Elena; Dogan, Rezarta Islamaj; Wilbur, W John; O'Leary, Dianne P; Przytycka, Teresa M.

BMC Struct Biol ; 7: 53, 2007 Aug 09.

Artigo em Inglês | MEDLINE | ID: mdl-17688700

RESUMO

BACKGROUND: One approach for speeding-up protein structure comparison is the projection approach, where a protein structure is mapped to a high-dimensional vector and structural similarity is approximated by distance between the corresponding vectors. Structural footprinting methods are projection methods that employ the same general technique to produce the mapping: first select a representative set of structural fragments as models and then map a protein structure to a vector in which each dimension corresponds to a particular model and "counts" the number of times the model appears in the structure. The main difference between any two structural footprinting methods is in the set of models they use; in fact a large number of methods can be generated by varying the type of structural fragments used and the amount of detail in their representation. How do these choices affect the ability of the method to detect various types of structural similarity? RESULTS: To answer this question we benchmarked three structural footprinting methods that vary significantly in their selection of models against the CATH database. In the first set of experiments we compared the methods' ability to detect structural similarity characteristic of evolutionarily related structures, i.e., structures within the same CATH superfamily. In the second set of experiments we tested the methods' agreement with the boundaries imposed by classification groups at the Class, Architecture, and Fold levels of the CATH hierarchy. CONCLUSION: In both experiments we found that the method which uses secondary structure information has the best performance on average, but no one method performs consistently the best across all groups at a given classification level. We also found that combining the methods' outputs significantly improves the performance. Moreover, our new techniques to measure and visualize the methods' agreement with the CATH hierarchy, including the threshholded affinity graph, are useful beyond this work. In particular, they can be used to expose a similar composition of different classification groups in terms of structural fragments used by the method and thus provide an alternative demonstration of the continuous nature of the protein structure universe.

Assuntos

Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Análise de Sequência de Proteína , Bases de Dados de Proteínas , Modelos Químicos , Análise de Sequência de Proteína/estatística & dados numéricos , Homologia Estrutural de Proteína

SplicePort--an interactive splice-site analysis tool.

Dogan, Rezarta Islamaj; Getoor, Lise; Wilbur, W John; Mount, Stephen M.

Nucleic Acids Res ; 35(Web Server issue): W285-91, 2007 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-17576680

RESUMO

SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated for each cluster. The user can browse the identified clusters and their contributing elements, looking for new interesting signals, or can validate previously observed signals. The SplicePort web server can be accessed at http://www.cs.umd.edu/projects/SplicePort and http://www.spliceport.org.

Assuntos

Mapeamento Cromossômico/métodos , Biologia Computacional/métodos , DNA/genética , Modelos Genéticos , Reconhecimento Automatizado de Padrão/métodos , Sítios de Splice de RNA/genética , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Genoma , Humanos , Internet , Dados de Sequência Molecular , Alinhamento de Sequência/métodos , Interface Usuário-Computador

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA